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GRAPHICAL USER INTERFACE AND METHOD FOR MODIFYING 
PRONUNCIATIONS IN TEXT-TO-SPEECH AND SPEECH RECOGNITION SYSTEMS 

COPYRIGHT NOTICE 

A portion of the disclosure of this patent document contains material which is 

subject to copyright protection. The copyright owner has no objection to the facsimile 

reproduction by anyone of the patent document or the patent disclosure, as it appears in the 

Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights 

whatsoever. 

BACKGROUND OF THE INVENTION 
The invention disclosed herein relates generally to user interfaces and, in 
particular, to graphical user interfaces for use with text-to-speech and automated speech 
recognition systems. 

Voice or speech recognition and generation technology is gaining in importance 
as an alternative or supplement to other, conventional input and output devices. This will be 
particularly true with continued improvements and advances in the underlying software 
methodologies employed and in the hardware components which support the processing and 
storage requirements. As these technologies become more generally available to and used by the 
mass market, improvements are needed in the techniques employed in initializing and modifying 
speech recognition and generation systems. 

A few products exist which allow users to process files of text to be read aloud by 
synthesized or recorded speech technologies. In addition, there are software products used to 
process spoken language as input, identify words and commands, and trigger an action or event. 



148670 

EM070688900US 



1 



3376/34 

Some existing products allow users to add words to a dictionary, make modifications to word 
pronunciations in the dictionary, or modify the sounds created by a text-to-speech engine. 

However, users of these products are required to understand and employ 
specialized information about grammar, pronunciation, and linguistic rules of each language in 
which word files are to be created. Moreover, in some of these products the means of 
representing pronunciations requires mastery of a mark-up language with unique pronunciation 
keys not generally used in other areas. 

As a result, these products make text-to-speech and automated speech recognition 
technology inflexible and less accessible to the general public. They require users to become 
experts in both linguistic rules and programming techniques. The inflexibility arises in part 
because these products use general rules of the language in question to determine pronunciation 
without regard to context, such as geographic context in the form of dialects, or individual 
preferences regarding the pronunciation of certain words such as names. 

Further, the existing products generally provide less than satisfactory results in 
pronunciations or translations of pronunciations. The products do not perform well with respect 
to many types of words including acronyms, proper names, technological terms, trademarks, or 
words taken from other languages. Nor do these products perform particularly well in 
accounting for variations in pronunciations of words depending on their location in a phrase or 
sentence (e.g., the word "address" is pronounced differently when used as a noun as opposed to a 
verb). 

As a result, there is a need for a user interface method and system which expresses 
pronunciation rules and options in a simple way so that nonexpert users can take fuller advantage 
of the benefits of text-to-speech and speech recognition technologies. 
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SUMMARY OF THE INVENTION 

It is an object of the present invention to solve the problems described above with 
existing text-to-speech and speech recognition systems. 

It is another object of the present invention to provide a simple and intuitive user 
interface for setting and modifying pronunciations of words. 

It is another object of the present invention to provide for the use in text-to-speech 
and speech recognition systems of sounds or letter groups which are not typically used in or even 
violate the rules of a language. 

These and other objects of the invention are achieved by a method and user 
interface which allows users to make decisions about how to pronounce words and parts of 
words based on audio cues and common words with well known pronunciations. 

Thus, in some embodiments, users input or select words for which they want to 
set or modify pronunciations. To set the pronunciation of a given letter or letter combination in 
the word, the user selects the letter(s) and is presented with a list of common words whose 
pronunciations, or portions thereof, are substantially identical to possible pronunciations of the 
selected letters. Preferably the list of sample, common words is ordered based on frequency of 
correlation in common usage, the most common being designated as the default sample word, 
and the user is first presented with a subset of the words in the list which are most likely to be 
selected. 

In addition, embodiments of the present invention allow for storage in the 
dictionary of several different pronunciations for the same word, to allow for contextual 
differences and individual preferences. 
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Further embodiments provide for the storage of multiple dictionaries for different 
languages, but allow users to select pronunciations from various dictionaries to account for 
special words, parts of words, and translations. As a result, users may create and store words 
having any sound available to the system, even when the sound doesn't generally correspond 
with letters or letter groups according to the rules of the language. 

In addition to modifying pronunciations of letters in a word, embodiments of the 
present invention allow users to easily break words into syllables or syllable-like letter groupings 
or word subcomponents even when the rules of a given language do not provide for such 
groupings as syllables, and to specify which such syllables should be accented. As used herein, 
the word syllable refers to such traditional syllables as well as other groupings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The invention is illustrated in the figures of the accompanying drawings which are 
meant to be exemplary and not limiting, in which like references refer to like or corresponding 
parts, and in which: 

Fig. 1 is a block diagram of a system in accordance with embodiments of the 
present invention; 

Fig. 2 is a flow chart showing broadly a process of allowing users to modify word 
pronunciations in accordance with the present invention using the system of Fig. 1 ; 

Figs. 3A-3B contain a flow chart showing in greater detail the process of allowing 
users to modify word pronunciations in accordance with an embodiment of the present invention; 

Fig. 4 is a flow chart showing a process of testing the pronunciation of a word; 

and 
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Fig. 5-9 contain diagrams of screen displays showing the graphical user interface 
of one embodiment of the present invention. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 
Embodiments of the present invention are now described in detail with reference 
to the drawings in the figures. 

A text to speech ("TTS") and automated speech recognition ("ASR") system 10 is 
shown in Fig. 1. The system 10 contains a computerized apparatus or system 12 having a 
microcontroller or microprocessor 14 and one or more memory devices 16. The system 10 
further has one or more display devices 18, speakers 20, one or more input devices 22 and a 
microphone 24. All such components are conventional and known to those of skill in the art and 
need not be further described here. 

Memory device or devices 16, which may be incorporated in computer apparatus 
12 as shown or may be remotely located from computer 12 and accessible over a network or 
other connection, store several programs and data files in accordance with the present invention. 
A pronunciation selection program 26 allows, when executed on microcontroller 14 for the 
generation of the user interface described herein, the processing of a user's input and the retrieval 
of data from databases 28 and 30. Dictionary databases 28 are a number of databases or data 
files, one for each language handled by the system 10, which store character strings and one or 
more pronunciations associated therewith. Pronunciation databases 30 are a number of databases 
or data files, one for each of the languages, containing records each having a character or 
character group and a number of sample words associated therewith which contain characters 
that are pronounced in a manner which is substantially identical to the way the characters may be 
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pronounced. The sample words are selected in creating the pronunciation databases 30 based on 
grammatical and linguistic rules for the language. Preferably, the sample words for each 
character or character group (e.g., dipthong) are ordered generally from more common usage in 
pronunciation of the character to less common. 

Although shown as two databases, the dictionary database 28 and pronunciation 
database 30 may be structured as one data file or in any other format which facilitates retrieval of 
the pronunciation data as described herein and/or which is required to meet the needs of a given 
application or usage. 

The system 10 further contains a TTS module 32 and ASR module 34 stored in 
memory 16. These modules are conventional and known to those of skill in the art and include, 
for example, the ViaVoice® software program available from IBM. These modules 32 and 34 
convert text stored as digital data to audio signals for output by the speakers 20 and convert 
audio signals received through microphone 24 into digital data. The modules retrieve and utilize 
pronunciation data stored in the dictionary databases 28. 

A method for allowing users to easily modify the pronunciation data stored in 
dictionary databases 28, as performed by pronunciation selection program 26, is described 
generally in Fig. 2 and in greater detail in Figs. 3A-3B. Referring to Fig. 2, in accordance with 
the invention a character string, which may be a word, name, etc., is displayed on display device 
18, step 50. A user uses input devices 22 to select one or more letters from the string, step 52. 
As is understood, pronunciation variations may be linked to individual letters such as vowels or 
to groups of letters such as "ou", "ch", "th" or "gh". The program 26 queries pronunciation 
database 30 to retrieve the sample words associated with the selected letter or letter group, step 
54. If the letter or letter group is absent from the pronunciation database 30, an error message 
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may be sent or sample words for one of the letters may be retrieved. Some or all of the sample 
words are displayed, step 56, and the user selects one of the words, step 58. The program 26 
then generates pronunciation data for the character string using the sample word to provide a 
pronunciation of the selected letter(s), step 60. The string and pronunciation data are stored in 
5 the dictionary database 28, step 62, and the string may be audibly output by the output of the 
TTS module 32 or used to create a speaker verification or utterance for ASR module 34. 

The process implemented by program 26 is described in more detail in Figs. 3 A- 
3B. An exemplary embodiment of a user interface used during this process is illustrated in Figs. 

5-9. As shown in Fig. 5, interface 190 displayed on display device 18 contains: an input box 200 

S3 ■ ' 

)Sb for manual input of or display of selected characters; a test button 202 which is inactive until a 

£3 

s ~* word is selected; a modify button 204 which is similarly inactive until a word is selected; a 

[ fj selection list 206 consisting of the choices "sound", "accent" and "syllable" (or a "grouping"); 

'a, : 

5 and a workspace 208. 

^ As explained above, the system 10 preferably contains multiple dictionary and 

W 

pronunciation databases representing different languages. Referring now to Fig. 3 A, a user 
selects one of the languages, step 70, and the program 26 opens the dictionary for the selected 
language, step 72. To select a word or other character string, a user can choose to browse the 
selected dictionary, step 74, in which case the user selects an existing word from the database 76. 
Otherwise, the user enters a word such as by typing into input box 200, step 78. 
20 Next, the user can choose whether to test the pronunciation of the word, step 80, 

by selecting test button 202. The process of testing a word pronunciation is described below 
with reference to Fig. 4. 
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The user can choose to modify the word's pronunciation, step 82, by selecting 
modify button 204. If not, the user can store the word and current pronunciation by selecting the 
"OK" button in dialog 190, step 84. If the word is not an existing word in the dictionary 
database 28, step 86, the word and pronunciation data are stored in the dictionary, step 88. As 
explained below with reference to Fig. 4, the pronunciation data for an unmodified word is 
generated using default pronunciations based on the rules of the selected language. If the word 
already exists, the new pronunciation data is stored with the word in the dictionary, step 90, and 
alternate pronunciations may be referred to from contextual circumstances. 

If the user wishes to modify the pronunciation, the three choices in selection list 
206 are available. 

The selected word, now appearing in input box 200, is broken into individual 
characters and copied into workspace 208. See Fig. 6. Workspace 208 further shows syllable 
breaks (the dash in workspace 208) and accent marks (the apostrophes in workspace 208) for the 
current pronunciation. 

If the user selects to modify the syllable break, step 92, a breakpoint symbol 210 
is displayed, see Fig. 7. The symbol 210 may be moved by the user to identify a desired syllable 
breakpoint, step 94. The program 26 breaks any existing syllable to two syllables at a selected, 
breakpoint, step 96. 

If the user selects to modify the accent, step 98, an accent type selection icon 
group 212 is displayed in interface 190 (see Fig. 8). The group 212 contains three icons: a 
primary accent (or major stress) icon 212a, a secondary accent (or minor stress) icon 212b, and a 
no accent (or unstressed) icon 212c. The user selects an accent level by clicking one of the icons, 
step 100. The user then selects a syllable, step 102, by, for example, selecting a box in 
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workspace 208 immediately following the syllable. The program 26 identifies the selected 
syllable with the selected accent level, step 104, and may further adjust the remaining accents in 
accordance with rules of the selected language. For example, if the language provides for any 
one primary accented syllable, and the user selects a second syllable for a primary accent, the 
program may change the first primary accent to a secondary accent, or may delete all remaining 
accents entirely. 

Referring now to Fig. 3B, if the user selects to modify a letter sound in list 206, 
step 106, the user selects one or more letters in workspace 208, step 108. The program 26 
retrieves the sample words from the pronunciation database 30 for the selected language whose 
pronunciations, or portions thereof, are associated or linked with the selected letter(s), step 110. 
The words are displayed in word list 214, see Fig. 9. The sample word which represents the 
default pronunciation for the selected letter(s) is highlighted, step 1 12. See Fig. 9, in which the 
sample word "buy" is highlighted in word list 214 for pronunciation of the selected letter "i". A 
user can also listen to the pronunciations of the sample words. As also shown in Fig. 9, only two 
or three of the sample words may be shown in word list 214, with an option for the user to see 
and hear additional words. 

If the user selects one of the sample words, step 114, the pronunciation data, or 
portions thereof, for the selected word is associated with the letter(s) selected in the selected 
word contained in workspace 208, step 1 16. The modified word may then be modified further or 
stored, in accordance with the process described above, or may be tested as described below. 

In accordance with certain aspects of the present invention, it is recognized that 
most languages including English contain words taken from other languages. Therefore, the user 
is given (e.g., in word list 214 after selecting "more") the option of selecting a pronunciation for 
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the selected letters from another language, step 118. The user then selects the desired language, 
step 120, and the program 26 retrieves sample words associated with the selected letter(s) from 
the pronunciation database file 30 for that selected language, step 122. The sample words are 
then presented for the user's selection as explained above. 
5 As a result, a simple and flexible process is achieved for allowing users to modify 

word pronunciations. As one example of the ease and flexibility of the process, the word 
"michael" selected in Fig. 9 may be modified from the English pronunciation, "MiK'-el" to a 
Hebrew name "Mee-cha'-el" by adding a syllable break between the "a" and "e" and "i" and 
u ch", placing the primary accent on the new syllable "cha," and selecting appropriate 

?2 

vJk) pronunciations for the "i", "ch" (e.g., from the Hebrew language dictionary), "a" and "e" based 

on common words. No grammatical or linguistic expertise is required. 
in The process of testing a word's pronunciation is shown in Fig. 4. If the word 

s already is contained in the dictionary database 28, step 140, the stored pronunciation is retrieved, 

«P step 142. If more than one pronunciation exists for the word, the user may be prompted to select 

if 5 ™ 

^15 one, or a default used. If the word is not yet present, then for each letter or letter group, if a user 
has selected a pronunciation using the program 26, step 144, that pronunciation data is retrieved, 
step 146, and otherwise a default pronunciation may be selected, step 148. When all letters have 
been reviewed, step 150, the program 26 generates a pronunciation for the word using the 
retrieved letter pronunciations, step 152. Finally the TTS module outputs an audible 

20 representation of the retrieved or generated word pronunciation, step 154. 

Because the system described herein allows for multiple pronunciations for a 
single word, the TTS module must identify which pronunciation is intended for the word. The 
TTS module can identify the pronunciation based on the context in which the word is used. For 
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example, the pronunciations may be associated with objects such as users on a network, such that 
a message intended for a specific user would result in a correct selection of pronunciations. As 
another example, the TTS module may identify a word usage as noun vs. verb, and select the 
appropriate pronunciation accordingly. 

While the invention has been described and illustrated in connection with 
preferred embodiments, many variations and modifications as will be evident to those skilled in 
this art may be made without departing from the spirit and scope of the invention, and the 
invention is thus not to be limited to the precise details of methodology or construction set forth 
above as such variations and modification are intended to be included within the scope of the 
invention. 
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